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Abstract — Mixed-criticality  systems,  in  which  multiple  tasks  of 
varying  criticality  are  executed  on  a  single  hardware  platform,  are 
an  emerging  research  area  in  real-time  embedded  systems.  High- 
criticality  tasks  require  spatial  and  temporal  isolation  guarantees, 
whereas  low-criticality  tasks  should  efficiently  utilize  hardware 
resources.  Hardware-based  isolation  is  desirable,  but  commonly 
underutilizes  hardware  resources,  which  can  consist  of  multiple 
single-core,  multicore,  or  multithreaded  processors.  We  present 
FlexPRET,  a  processor  designed  specifically  for  mixed-criticality 
systems  by  allowing  each  task  to  make  a  trade-off  between 
hardware-based  isolation  and  efficient  processor  utilization.  Flex¬ 
PRET  uses  fine-grained  multithreading  with  flexible  scheduling 
and  timing  instructions  to  provide  this  functionality. 

I.  Introduction 

A  current  trend  in  real-time  embedded  systems,  driven 
by  size,  weight,  and  power  concerns,  is  consolidating  many 
increasingly  complex  applications  onto  fewer  hardware  plat¬ 
forms.  A  processor  must  then  execute  tasks  with  differing 
importance,  safety,  or  certification  requirements — creating  a 
mixed-criticality  system  [1],  [2],  [3],  These  requirements  are 
often  specified  using  criticality  levels,  such  as  the  five  levels 
(A-E)  used  in  the  DO-178C  avionics  standard  [4],  Higher- 
criticality-level  tasks  correspond  to  hard  real-time  tasks, 
whereas  lower-criticality-level  tasks  correspond  to  soft  real¬ 
time  tasks.  Each  criticality  level  has  different  requirements. 

Spatial  and  temporal  isolation  prevent  a  task  from  be¬ 
ing  unintentionally  affected  by  another  task:  spatial  isolation 
protects  a  task’s  state  (stored  in  registers  and  memory),  and 
temporal  isolation  protects  a  task’s  desired  timing  behavior. 
Timing  predictability,  still  an  issue  even  if  a  task  is  isolated,  is 
desirable  to  tightly  bound  a  task’s  worst-case  execution  time 
(WCET)  and  avoid  over  provisioning  resources.  These  are 
desirable  properties  for  verification  and  certification  of  a  hard 
real-time  task.  For  a  soft  real-time  task,  however,  efficiently 
utilizing  the  processor  is  more  important  than  isolation  and 
predictability  guarantees. 

Hardware-based  isolation  is  robust  and  often  used  in  safety- 
critical  systems  [5],  [6],  but  traditionally  utilizes  hardware 
resources  inefficiently  [7],  as  each  processor  can  only  execute 
a  single  task.  As  a  consequence,  extensive  research  has  been 
performed  during  the  past  several  years  on  software  scheduling 
of  mixed-criticality  systems  [1],  [8],  [9],  [2],  [10],  [11],  where 
isolation  is  provided  by  software  running  on  a  real-time 
operating  system  (RTOS).  Although  this  can  drastically  reduce 
hardware  costs,  the  RTOS  itself  must  be  verified  and  certified. 
This  includes  accounting  for  overhead  and  behavior  of  run¬ 
time  monitoring  and  control,  which  could  include  preemption 


to  switch  tasks,  monitor  execution  times,  or  handle  sensing 
and  actuation  [12],  In  this  paper,  we  take  a  different  approach 
and  investigate  if  a  new  processor  architecture  can  provide 
hardware-based  isolation  to  mixed-criticality  systems  without 
underutilizing  resources. 

Hardware-based  isolation  can  be  achieved  by  deploying 
each  task  on  separate  computational  components:  either  pro¬ 
cessors,  cores  in  a  multicore  processor,  or  hardware  threads 
in  a  multithreaded  processor.  Although  potentially  better  than 
one  task  per  processor,  one  task  per  core  can  still  underutilize 
resources.  A  hardware  thread,  subsequently  referred  to  as  just 
a  thread,  uses  hardware  support  to  share  the  underlying  proces¬ 
sor  with  other  threads.  Provided  thread  scheduling  maintains 
temporal  isolation,  which  is  not  done  in  many  multithreaded 
processors,  one  task  per  thread  can  better  utilize  resources  by 
enabling  multiple  tasks  to  be  deployed  on  a  single  processor 
without  losing  hardware-based  isolation. 

One  approach  is  to  use  fine-grained  multithreading  (in¬ 
terleaved  multithreading,  barrel  processor)  [13],  [14],  where 
instructions  from  different  threads  are  interleaved  in  the 
pipeline  every  cycle.  Existing  fine-grained  multithreaded  pro¬ 
cessors  have,  however,  inflexible  thread  scheduling  mecha¬ 
nisms.  PTARM  [15]  provides  scheduling  guarantees  to  each 
thread  but  cannot  utilize  cycles  when  a  thread  is  no  longer 
active,  whereas  XMOS  XI  [16]  loses  temporal  isolation  if  it 
utilizes  cycles  when  a  thread  is  not  active.  Both  processors  also 
require  four  threads  to  be  active  to  fully  utilize  the  pipeline. 

This  paper  presents  FlexPRET,  a  processor  designed  to 
exhibit  architectural  techniques  useful  for  mixed-criticality 
systems.  FlexPRET  uses  fine-grained  multithreading,  and  by 
classifying  each  thread  as  either  a  hard  real-time  thread 
(HRTT)  or  a  soft  real-time  thread  (SRTT),  provides  hardware- 
based  isolation  to  HRTTs  while  allowing  SRTTs  to  efficiently 
utilize  the  processor.  Unlike  other  fine-grained  multithreaded 
processors,  FlexPRET  supports  an  arbitrary  interleaving  of 
threads  in  the  pipeline  and  uses  a  novel  thread  scheduler. 
Each  thread,  either  an  HRTT  or  SRTT,  can  be  guaranteed 
to  be  scheduled  at  certain  cycles  for  isolation  or  throughput 
guarantees.  If  no  thread  is  scheduled  for  a  cycle  or  a  scheduled 
thread  has  completed  its  task,  that  cycle  is  used  by  some  SRTT 
in  a  round-robin  fashion — efficiently  utilizing  the  processor. 

Even  with  spatial  and  temporal  isolation,  a  task’s  WCET 
can  be  difficult  to  bound;  the  predictability  and  performance 
trade-off  of  the  underlying  processor  depends  on  architectural 
decisions,  typically  regarding  branch  prediction,  pipeline  or¬ 
dering,  and  caches.  Current  research  in  scheduling  mixed- 


critically  systems  assumes  that  WCET  bounds  are  either  un¬ 
safe  [9]  (not  actually  a  bound)  or  overly  pessimistic  [1],  [2], 
The  problem  is  many  processors  lack  fine-grained  predictabil¬ 
ity. ,  meaning  the  ability  to  determine  the  execution  time  of 
an  instruction  with  little  knowledge  of  execution  history — 
a  desirable  property  for  WCET  analysis.  Complex  branch 
prediction  and  multilevel  caches,  used  to  optimize  average-case 
performance,  do  not  have  fine-grained  predictability  and  make 
WCET  analysis  a  non-trivial  task  [17],  FlexPRET  is  related 
to  previous  precision-timed  (PRET)  machines  [18],  [15],  [19], 
where  fine-grained  multithreading  allows  processors  to  remove 
dynamic  branch  prediction  and  caches  for  fine-grained  pre¬ 
dictability  with  less  of  a  performance  penalty. 

FlexPRET  also  uses  a  variation  of  timing  instruc¬ 
tions  [20],  [18],  [19]  to  provide  direct  control  over  timing 
(in  nanoseconds).  With  these  instructions,  the  processor  itself 
is  responsible  to  execute  with  the  timing  specified  by  the 
instruction — enabling  a  more  direct  temporal  mapping  between 
languages  with  timing  semantics  and  the  processor.  Unlike 
previous  work,  cycles  are  not  left  unused  to  satisfy  temporal 
constraints;  FlexPRET  reallocates  these  cycles  to  SRTTs. 

Specific  contributions  of  this  paper  include: 

•  A  novel  processor  design  exhibiting  architectural  tech¬ 
niques  targeted  for  mixed-criticality  systems,  provid¬ 
ing  hardware-based  isolation  to  hard  real-time  threads 
while  allowing  soft  real-time  threads  to  efficiently  uti¬ 
lize  processor  resources.  (Sections  III-B,  III-D,  III-C). 

•  Timing  instructions,  extending  the  RISC-V  ISA  [21], 
that  enable  cycles  to  be  reallocated  to  other  threads 
when  not  needed  to  satisfy  a  temporal  constraint 
(Section  III-D). 

•  A  concrete  soft-core  FPGA  implementation,  evalu¬ 
ated  for  resource  usage,  and  execution  of  two  mixed 
criticality  task  sets  to  demonstrate  properties  and  a 
possible  scheduling  methodology  (Section  IV). 

II.  Motivating  Example 

In  this  section,  we  use  a  simple  mixed-criticality  example 
to  demonstrate  some  useful  properties  of  FlexPRET.  Although 
in  this  example  each  task  is  deployed  on  a  separate  thread,  in 
practice,  software-based  scheduling  can  be  used  to  deploy  mul¬ 
tiple  tasks  on  a  single  thread;  the  deployment  of  a  large  mixed- 
criticality  task  set  on  FlexPRET  is  shown  in  Section  IV-D. 

Consider  a  mixed-criticality  system  that  consists  of  three 
independent,  periodic  tasks,  ta,  tb -  and  tc,  each  mapped 
to  its  own  thread.  The  task  identifiers  A,  B.  C  correspond  to 
criticality  levels  from  highest  to  lowest,  motivated  by  criticality 
levels  used  in  the  DO-178C  avionics  standard,  with  ta  and 
Tg  hard  real-time  tasks  that  must  be  verified  and  certified  to 
always  meet  their  deadline  and  C  a  soft  real-time  task  that 
requires  less-strict  guarantees.  Each  task  t,  has  a  deadline 
equal  to  its  period  1).  FlexPRET’s  thread  scheduler  ensures 
hard  real-time  tasks  are  executed  at  a  constant  rate  for  isolation 
and  predictability;  when  a  cycle  is  not  used  for  a  hard  real¬ 
time  task,  which  includes  when  tasks  finish  early,  that  cycle  is 
used  by  a  soft  real-time  task. 

In  Figure  1,  a  potential  execution  trace  for  a  single  hyperpe¬ 
riod  (f  =  10, 000  processor  cycles  is  the  least  common  multiple 


of  task  periods)  is  shown,  where  rectangles  indicate  a  task  is 
executing.  The  upper  plot  is  unusual  in  that  the  units  of  both 
axes  are  processor  cycles,  but  this  is  done  to  more  clearly  show 
which  task’s  thread  is  executed  each  cycle;  the  vertical  axis 
indicates  which  threads  are  executed  in  a  particular  pipeline 
stage  each  cycle  over  a  four  cycle  interval.  The  sequence  of 
thread  interleaving  is  read  top  to  bottom,  left  to  right,  as  shown 
for  two  intervals  by  the  lower  plot. 

Task  tb  executes  once  every  four  cycles  and  is  the  first 
to  complete  ( t  =  2,000).  Its  cycles  are  not  needed  until  its 
next  period  ( t  =  5,  000)  and  are  donated  to  tc .  Task  ta, 
requiring  more  cycles  than  tb  to  meet  its  deadline,  executes 
once  every  two  cycles  and  also  donates  cycles  to  tq  when 
it  completes  ( t  =  8, 000).  Notice  that  tb  does  not  donate 
cycles  to  ta',  as  a  hard  real-time  task,  ta  would  be  verified  to 
meet  its  deadline  with  only  its  allocated  cycles  and  does  not 
benefit  by  completing  earlier.  Tasks  ta  and  tb  are  temporally 
isolated  by  only  being  executed  at  prescribed  cycles — enabling 
independent  verification.  Task  tc  efficiently  uses  every  cycle 
not  needed  by  ta  or  tb,  but  has  sacrificed  temporal  isolation; 
timing  behavior  depends  on  when  ta  and  tb  start  and  end. 
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Fig.  1:  FlexPRET  executing  a  simple  mixed-criticality  exam¬ 
ple.  Vertical  direction  shows  which  threads  are  executed  each 
processor  cycle  over  a  four  cycle  interval. 


III.  FlexPRET  Design 

FlexPRET  is  a  32-bit,  5-stage,  fine-grained  multithreaded 
processor  with  software-controlled,  flexible  thread  scheduling. 
It  uses  a  classical  RISC  5-stage  pipeline:  instruction  fetch  (F), 
decode  (D),  execute  (E),  memory  access  (M),  and  writeback 
(W).  Predict  not-taken  branching  and  software-controlled  lo¬ 
cal  memories  are  used  for  fine-grained  predictability.  It  also 
implements  the  RISC-V  ISA  [21],  an  ISA  designed  to  support 
computer  architecture  research,  that  we  extended  to  include 
timing  instructions. 

A.  Background 

Fine-grained  multithreading  is  the  ability  to  switch  between 
different  hardware  threads  on  each  clock  cycle,  allowing 
instructions  from  multiple  hardware  threads  to  be  interleaved 
in  the  pipeline.  Each  hardware  thread,  subsequently  referred 
to  as  just  thread ,  maintains  its  own  state:  general-purpose 
registers,  program  counter,  and  other  control  registers.  The 
thread  scheduler  decides  from  which  thread  to  fetch  an  in¬ 
struction  each  cycle  and  will  be  discussed  in  Section  III-C.  A 
pipeline  hazard  occurs  when  executing  a  particular  instruction 
in  the  next  clock  cycle  could  cause  incorrect  execution  and 
can  be  prevented  by  stalling  (waiting)  or  flushing  (aborting) 
this  instruction — wasting  cycles.  If  multiple  threads  are  inter¬ 
leaved  in  the  pipeline,  the  previous  instmction  has  progressed 


further  through  the  pipeline  when  the  next  instruction  from 
that  thread  is  fetched,  reducing  or  eliminating  cycles  that  are 
wasted  to  prevent  hazards  by  increasing  the  spacing  between 
dependant  instructions.  Such  interleaving  increases  overall 
processor  throughput  (total  number  of  instructions  processed 
on  all  threads),  but  increases  the  latency  (total  processor  cycles 
between  start  and  finish)  of  computing  a  task,  compared  to  if 
the  tasks  were  executed  on  a  single-threaded  processor. 

Example  1:  Consider  a  single -threaded  processor  execut¬ 
ing  a  branch  instruction  that  should  be  taken.  This  particular 
processor  does  not  calculate  the  branch  decision  and  target 
address  until  the  end  of  the  execute  stage,  so  two  cycles  (2 
and  3)  are  wasted  (flushed)  if  a  branch  is  taken. 

TID  Addr.  Inst.  Cycle 

I  2  3  4  5  6  7  8 

0  0x00  BR  OxOC  F  D  E  M  W 

0  0x04  I  F  D  - 

0  0x08  I  F  -  -  -  - 

0  OxOC  I  F  D  E  M  W 

The  thread  ID  (TID)  column  shows  that  each  cycle  an 
instruction  is  fetched  from  the  same  thread  (0).  The  instruction 
and  address  columns  show  example  instructions  and  their 
address  in  memory,  where  BR  OxOC  means  branch  to  address 
OxOC,  and  I  is  an  arbitrary  instruction.  Dashes  indicate  an 
instruction  was  flushed  (instructions  at  0x04  and  0x08). 

Example  2:  Now  consider  the  same  program  running  on  a 
fine-grained  multithreaded  processor  sharing  the  pipeline  with 
three  other  threads  in  a  round-robin  fashion.  The  thread  is 
not  scheduled  again  until  after  the  branch  decision  and  target 
address  are  calculated,  so  no  cycles  are  wasted,  but  the  thread 
has  a  larger  latency. 
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In  single -threaded  processors,  switching  to  a  different  task 
and  maintaining  spatial  isolation  involves  a  context  switch , 
saving  the  state  of  one  task  and  restoring  the  state  of  another, 
a  time-consuming  operation  performed  entirely  by  software 
unless  the  processor  provides  hardware  support.  If  each  task 
is  assigned  to  a  different  thread,  a  fine-grained  multithreading 
processor  is  capable  of  context  switching  every  clock  cycle.  In 
addition  to  reduced  overhead  when  switching  between  tasks, 
this  also  allows  low-latency  reactions  to  external  IO;  a  task 
can  start  reacting  within  a  few  cycles  instead  of  waiting  for  a 
RTOS  to  context  switch. 

B.  Pipeline 

FlexPRET  allows  an  arbitrary  interleaving  of  threads1  in 
the  pipeline  (i.e.  no  restrictions  on  the  schedule),  giving 
the  thread  scheduler  flexibility  to  well-utilize  the  processor. 
Unfortunately,  this  also  means  the  pipeline  is  susceptible  to 
data  and  control  hazards,  which  can  occur  when  the  spacing 
between  two  instructions  from  the  same  thread  is  too  close.  For 

The  physical  number  is  a  hardware  decision;  we  support  1-8  threads. 


example,  the  thread  scheduler  could  schedule  only  one  thread 
to  be  executed  in  the  pipeline,  and  two  cycles  would  need  to 
be  flushed  when  a  branch  is  taken  (as  occurred  in  Example  1). 

As  in  a  typical  single -threaded  RISC  pipeline,  FlexPRET 
avoids  most  data  hazards  by  forwarding,  which  supplies  re¬ 
quired  data  from  later  pipeline  stages  to  avoid  waiting  until  it 
is  written  back  to  the  register  file.  The  only  difference  is  that 
thread  IDs  must  also  be  compared  so  that  forwarding  only 
occurs  between  instructions  from  the  same  thread.  There  are 
still  hazards  that  cannot  always  be  avoided  with  forwarding 
because  the  required  data  is  not  yet  computed,  such  as  a  data 
hazard  with  memory  load  or  a  control  hazard  with  a  jump  or 
branch  taken.  Unlike  a  typical  single-threaded  RISC  pipeline, 
stalling  and  flushing  must  be  carefully  performed  as  to  not 
disrupt  the  schedule,  which  would  reduce  temporal  isolation. 
Stalling  is  done  by  replaying  the  instruction  in  the  thread’s 
next  scheduled  slot,  and  flushing  (decision  made  by  execute 
stage)  is  only  done  on  instructions  in  the  fetch  or  decode  stage 
with  the  same  thread  ID. 

The  spacing  required  between  two  particular  instructions 
from  the  same  thread  to  prevent  hazards  depends  on  both  the 
ISA  and  how  it  is  implemented.  For  FlexPRET,  if  a  jump  or 
branch  occurs,  the  subsequent  two  processor  cycles  must  not 
execute  an  instruction  from  that  thread,  which  is  achieved  by 
the  flush  operation  just  described.  Memory  loads  and  stores 
occur  in  a  single  processor  cycle,  but  in  the  pipeline  stage 
after  the  execute  stage;  if  the  execute  stage  needs  the  result  of 
a  memory  read  (e.g.  to  perform  an  arithmetic  operation),  these 
instructions  must  not  be  scheduled  next  to  each  other.  Even 
though  the  number  of  scheduled  processor  cycles  required  to 
execute  a  sequence  of  instructions  varies  with  scheduling,  this 
number  is  still  predictable — it  can  be  exactly  computed  for 
any  sequence  of  instructions  if  the  scheduling  is  known. 

Example  3:  Consider  FlexPRET  executing  a  schedule  that 
alternates  between  two  threads.  Only  one  instruction  (follow¬ 
ing  the  branch  instruction)  needs  to  be  flushed  when  a  thread 
branches,  and  forwarding  allows  the  ADD  instruction  to  use 
the  result  of  the  LD  instruction  without  stalling. 
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Most  fine-grained  multithreaded  processors,  such  as  the 
UltraSPARC  T1  [14],  XMOS  XS1  [16],  and  PTARM  [15],  do 
not  support  an  arbitrary  interleaving  of  threads.  They  require  a 
sufficient  spacing  between  instructions  from  the  same  thread  as 
to  not  require  forwarding,  stalling,  or  flushing,  saving  the  area 
cost  of  these  mechanisms.  This  is  overly  restrictive  for  some 
applications — there  must  be  at  least  four  threads  active  (for 
instance)  to  fully  utilize  the  pipeline  and  a  single  thread  cannot 
be  scheduled  more  frequently  than  once  every  four  cycles.  By 
allowing  an  arbitrary  interleaving,  FlexPRET  allows  a  trade¬ 
off  between  overall  throughput  and  single  thread  latency.  If 
a  deadline  needs  to  be  met,  a  thread  can  be  scheduled  more 
frequently,  but  can  waste  more  cycles  preventing  hazards. 


C.  Thread  Scheduling 

The  pipeline  supports  an  arbitrary  interleaving  of  threads 
by  using  knowledge  of  instructions  in  the  pipeline  to  forward 
or  stall  to  prevent  data  or  control  hazards.  Without  any  restric¬ 
tions  on  thread  scheduling,  however,  it  is  difficult  to  predict 
how  many  thread  cycles  (cycles  a  thread  is  scheduled)  it  would 
require  for  a  thread  to  execute  a  sequence  of  instructions 
because  pipeline  spacing  between  them  can  vary  unpredictably. 

Definition  1:  If  the  scheduling  frequency  of  a  thread  is 
1/X,  it  can  use  a  given  stage  of  the  pipeline  exactly  once 
every  X  processor  cycles.  For  example,  if  thread  To  has  a 
scheduling  frequency  of  1/2  and  T  is  a  different  thread,  then 
a  resulting  schedule  could  be:  To  T  Tq  T  Tq  T  .... 

A  constant  scheduling  frequency  is  useful  for  WCET 
analysis  because  the  thread  cycle  cost  of  each  instruction  is 
constant  and  known  for  each  scheduling  frequency.  A  branch- 
taken  or  jump  instruction,  for  example,  requires  three,  two, 
or  one  thread  cycles  for  scheduling  frequencies  1,  1/2,  or 
1/3, 1/4, . . .,  respectively. 

To  be  useful  for  mixed-criticality  systems,  hard  real-time 
threads  (HRTTs)  are  guaranteed  constant  scheduling  frequen¬ 
cies,  and  soft  real-time  threads  (SRTTs)  efficiently  use  avail¬ 
able  cycles  by  not  providing  this  guarantee.  To  implement  this, 
both  HRTTs  and  SRTTs  can  either  be  active  or  sleeping.  If 
active,  the  thread  is  allowed  to  be  scheduled.  If  sleeping,  the 
thread  is  not  allowed  to  be  scheduled  and  thus  will  not  con¬ 
sume  any  processor  cycles.  A  thread  is  put  in  this  mode  using 
the  TS  (thread  sleep)  instruction.  A  thread  can  be  activated  by 
an  interrupt  mechanism,  such  as  a  timing  instruction  or  external 
I/O — allowing  rapid,  event-driven  responses  without  wasting 
cycles  polling.  An  active  HRTT  is  only  scheduled  at  prescribed 
cycles  to  maintain  a  constant  scheduling  frequency.  In  addition 
to  being  potentially  scheduled  at  prescribed  cycles,  an  active 
SRTT  will  share  available  cycles  between  all  other  active 
SRTTs  in  a  round-robin  fashion.  Available  cycles  themselves 
can  be  prescribed  for  SRTTs  or  occur  when  an  SRTT  or  HRTT 
is  sleeping. 

Example  4:  Consider  a  system  with  active  HRTT  To, 
sleeping  HRTT  Ti,  active  SRTTs  T2  and  T3,  and  sleeping 
SRTT  T4.  If  To  has  a  scheduling  frequency  of  1/4,  and  Ti 
has  a  scheduling  frequency  of  1/2,  then  a  resulting  schedule 
could  be  as  follows: 

m  rji  rji  rri  rri  m  rri  rj~\ 

10  12  -to  J3  13 

To  is  the  only  active  HRTT  and  is  scheduled  every  forth 
cycle  to  satisfy  its  scheduling  frequency.  The  rest  of  the  cycles 
are  shared  round-robin  between  active  SRTTs  T2  and  T3. 

Example  5:  Consider  the  same  system  as  Example  4,  but 
HRTT  Ti  and  SRTT  T4  have  been  activated.  A  resulting 
schedule  could  be  as  follows: 

T0  Ti  T2  Ti  T0  Ti  T3  T  ,  T0  Xj  T4  X,  ... 

Now  only  one  out  of  every  four  cycles  is  used  for  SRTTs. 
An  SRTT  can  be  scheduled  like  an  HRTT  to  guarantee  a 
minimum  scheduling  frequency,  but  unlike  an  HRTT,  would 
also  share  available  cycles  with  other  SRTTs. 

In  addition  to  being  able  to  provide  scheduling  frequency 
guarantees  to  threads  and  utilizing  unused  cycles,  this  schedul¬ 


ing  technique  is  useful  for  applications  with  varying  concur¬ 
rency  and  deadlines.  Although  the  thread  scheduling  could 
be  statically  set  at  boot  and  never  changed,  it  could  also  be 
changed  dynamically  during  runtime.  For  example,  an  HRTT’s 
scheduling  frequency  could  be  increased  to  meet  a  deadline  it 
would  otherwise  miss,  or  a  platform  could  switch  modes  of 
operation. 

The  thread  scheduler  uses  two  control  registers  to  im¬ 
plement  this  technique:  the  slots  control  register  prescribes 
cycles  to  certain  threads  and  the  thread  mode  control  register 
stores  whether  a  thread  is  active  or  sleeping.  A  thread  requires 
supervisory  mode  to  modify  the  slots  control  register  or  the 
thread  mode  of  a  different  thread.  The  slots  control  register 
provides  8  slots,  where  each  slot  can  have  one  of  the  following 
values:  D  (the  slot  is  disabled),  S  (the  slot  is  used  for  SRTTs), 
or  To  —  T72  (the  slot  is  dedicated  to  that  thread  ID  and  only 
used  for  SRTTs  if  the  thread  is  sleeping).  To  decide  which 
thread  to  schedule  next,  non-disabled  slots  are  cycled  through 
in  a  round-robin  fashion,  either  using  the  specified  thread  ID  if 
active  or  delegating  to  SRTT  round-robin.  It  is  the  responsiblity 
of  the  programmer  or  compiler  to  then  assign  the  slots  control 
register  with  HRTT  IDs  such  that  each  HRTT  has  a  constant 
scheduling  frequency. 

While  the  slots  control  register  guarantees  cycles  to  certain 
threads,  the  thread  mode  register  specifies  the  mode  of  each 
thread  and  is  used  to  schedule  SRTTs.  A  thread  can  be  in 
one  of  four  modes:  active  HRTT  (HA),  sleeping  HRTT  (HZ), 
active  SRTT  (SA),  or  sleeping  SRTT  (SZ).  A  disabled  thread 
can  just  be  set  to  an  HRTT  mode  and  not  be  specified  in  the 
slots  control  register.  When  a  cycle  is  delegated  to  SRTTs  (its 
value  is  S  or  the  specified  thread  is  sleeping),  the  next  SRTT 
in  a  round-robin  rotation  of  active  SRTTs  is  selected. 

The  layout  of  the  slots  control  register  is  shown  in  Table  la, 
with  a  possible  assignment  thats  implements  the  schedule  in 
Examples  4  and  5  in  parentheses;  the  actual  schedules  are 
only  different  because  of  different  thread  modes.  If  Sfilf) 
and  52(Ti)  were  to  be  swapped,  X)  would  no  longer  have  a 
constant  scheduling  frequency  because  the  spacing  between 
instructions  from  that  thread  would  no  longer  be  constant 
(Ti  T0  S  Ti  Ti  T0  S  T)  . . .).  The  layout  of  the  thread  mode 
register  is  shown  in  Table  lb,  with  values  corresponding  to 
Example  4  in  parentheses.  Hardware  can  also  modify  the  mode 
of  a  thread,  allowing  interrupts  to  wake  a  thread  from  sleep. 


-This  is  the  case  for  a  configuration  with  8  threads.  A  32-bit  register  can 
support  up  to  14  unique  thread  IDs. 
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TABLE  I:  Thread  scheduling  is  set  by  two  control  registers. 


D.  Timing  Instructions 

New  timing  instructions  augment  the  RISC-V  ISA  for 
expressing  real-time  semantics.  In  contrast  to  previous  PRET 
architectures  supporting  timing  instructions  [22],  [19],  [15], 
our  design  is  targeted  for  mixed-critical  systems. 

The  FlexPRET  processor  contains  an  internal  clock  that 
counts  the  number  of  elapsed  nanoseconds  since  the  processor 
was  booted.  The  current  time  is  stored  in  a  64-bit  register, 
meaning  that  the  processor  can  be  active  for  584  years  without 
the  clock  counter  wrapping  around.  Two  new  instructions  can 
be  used  to  get  the  current  time:  get  time  high  GTH  rl  and 
get  time  low  GTL  r2  store  the  higher  and  lower  32  bits  in 
register  rl  and  r2,  respectively.  When  GTL  is  executed,  the 
processor  stores  internally  the  higher  32  bits  of  the  clock 
and  then  returns  this  stored  value  when  executing  GTH.  As 
a  consequence,  executing  GTL  followed  by  GTH  is  atomic,  as 
long  as  the  instruction  order  is  preserved. 

To  provide  a  lower  bound  on  the  execution  time  for  a 
code  fragment,  the  RISC-V  ISA  is  extended  with  a  delay  until 
instruction  DU  rl,  r2,  where  rl  is  the  higher  32  bits  and  r2 
is  the  lower  32  bits  of  an  absolute  time  value.  Semantically, 
the  thread  is  delayed  (replays  this  instruction)  until  the  current 
time  becomes  larger  or  equal  to  the  time  value  specified  by  rl 
and  r2.  However,  in  contrast  to  previous  processors  supporting 
timing  instructions  (e.g.,  PTARM  [15],  [19]),  the  clock  cycles 
are  not  wasted,  but  can  instead  be  utilized  for  other  SRTTs. 

If  the  task  is  firm  (there  is  no  utility  of  continuing  after  a 
missed  deadline),  it  should  be  interrupted  when  the  deadline 
is  missed,  providing  an  upper  bound.  Instruction  exception  on 
expire  EE  rl,  r2  enables  a  timer  exception  that  is  executed 
when  the  current  time  exceeds  rl,  r2.  The  jump  address  is 
specified  by  setting  a  control  register  with  MTPCR  (move  to 
program  control  register).  Only  one  exception  per  thread  can 
be  active  at  the  same  point  in  time;  nested  exceptions  must  be 
implemented  in  software.  The  instruction  deactivate  exception 
on  expire  DE  deactivates  the  timer  exception. 

Exception  on  expire  can  be  used  in  different  ways.  Besides 
implementing  an  upper  bound  for  firm  tasks  or  a  preemptive 
scheduler,  it  can  also  be  used  as  a  timer  to  activate  a  thread 
at  a  certain  point  in  time  in  the  future.  By  first  issuing  an 
exception  on  expire  and  then  executing  a  new  thread  sleep 
TS  instruction,  the  clock  cycles  for  the  sleeping  thread  can  be 
utilized  by  other  active  SRTTs.  Another  use  of  exception  on 
expire  is  for  anytime  algorithms ,  that  is,  algorithms  that  can 
be  interrupted  at  any  point  in  time  and  returns  a  better  solution 
the  longer  time  it  is  executed. 

E.  Memory  Hierarchy 

For  spatial  isolation  between  threads,  FlexPRET  allows 
threads  to  read  anywhere  in  memory,  but  only  write  to  certain 
regions.  The  regions  are  specified  by  control  registers  that  can 
only  be  set  by  a  thread  in  supervisory  mode  with  MTPCR. 
Virtual  memory  is  a  standard  and  suitable  approach,  but  Flex¬ 
PRET  currently  uses  a  different  scheme  for  simplicity.  There 
is  one  control  register  for  the  upper  address  of  a  shared  region 
(which  starts  at  the  bottom  of  data  memory)  and  two  control 
registers  per  thread  for  the  lower  and  upper  addresses  of  a 
thread-specific  region.  Memory  is  divided  into  lkB  regions. 


and  a  write  only  succeeds  if  the  address  is  within  the  shared  or 
thread-specific  region.  By  specifying  all  thread-specific  regions 
and  the  shared  region  to  be  disjoint,  each  thread  will  have  both 
private  memory  and  access  to  shared  memory. 

For  timing  predictability,  FlexPRET  uses  scratchpad  mem¬ 
ories  [23].  These  are  local  memories  that,  in  contrast  to  caches, 
have  a  separate  address  space  than  main  memory  and  are 
explicitly  controlled  by  software;  all  valid  memory  accesses  al¬ 
ways  succeed  and  are  single  cycle.  With  scratchpad  memories, 
WCET  analysis  only  needs  to  determine  which  address  space  a 
memory  operation  is  using,  but  not  the  current  state  of  a  cache, 
a  benefit  for  fine-grained  predictability.  Instructions  are  stored 
in  instruction  scratchpad  memory  (I-SPM)  and  data  is  stored 
separately  in  data  scratchpad  memory  (D-SPM).  Scratchpad 
memories  are  not  required;  caches  could  be  used  instead  if 
the  reduction  in  fine-grained  predictability  is  acceptable.  We 
envision  a  hybrid  approach  where  HRTTs  tasks  use  scratchpads 
and  SRTTs  use  caches  for  future  versions  of  FlexPRET. 

F.  Programming,  Compilation,  and  Timing  Analysis 

FlexPRET  can  be  programmed  using  low  level  program¬ 
ming  language,  such  as  C,  but  augmented  with  constructs 
for  expressing  temporal  semantics.  The  proposed  hardware 
architecture  can  be  an  integral  part  of  a  precision  timed 
infrastructure  [24]  that  includes  languages  and  compilers  with 
an  ubiquitous  notion  of  time.  Such  a  complete  infrastructure 
with  timing-aware  compilers  is  outside  the  scope  of  this  paper; 
instead,  we  use  a  RISC-V  port  of  the  gcc  compiler  and 
implement  the  new  timing  instructions  using  inline  assembly. 
The  following  code  fragment  illustrates  how  a  simple  periodic 
control  loop  can  be  implemented. 

1  int  h,l;  //  High  and  low  32-bit  values 

2  get_t ime (h, 1 ) ;  //  Current  time  in  nanoseconds 

3  while (1) {  //  Repeat  control  loop  forever 

4  add_ms (h, 1, 10 ) ;  //  Add  10  milliseconds 

5  exception_on_expire (h, 1 , missed_deadline_handler )  ; 

6  compute_task ( ) ;  //  Sense,  compute,  and  actuate 

7  delay_until (h, 1) ;  //  Delay  until  next  period 

8  } 

Before  the  control  loop  is  executed,  the  current  time  (in 
nanoseconds)  is  stored  in  variables  h  and  1  (line  2).  The 
time  is  incremented  by  10ms  (line  4)  and  a  timer  ex¬ 
ception  is  enabled  (line  5),  followed  by  task  execution 
(line  6).  If  a  deadline  is  missed,  an  exception  handler 
missed_deadline_handler  is  called.  To  force  a  lower 
bound  on  the  timing  loop,  the  execution  is  delayed  until 
the  time  period  has  elapsed  (line  7);  the  cycles  during  the 
delay  can  be  used  by  an  active  SRTT.  Functions  get_time, 
exeception_on_expire,  and  delay_until  implement 
the  new  RISC-V  timing  instructions  using  inline  assembly. 

To  have  full  control  over  timing,  real-time  applications  can 
be  implemented  as  bare  metal  software,  using  only  lightweight 
libraries  for  hardware  interaction.  As  a  scheduling  design 
methodology,  we  propose  that  tasks  with  highest  criticality 
level  A  are  assigned  individual  HRTTs,  thus  providing  both 
temporal  and  spatial  isolation.  Criticality  level  B  tasks  also 
use  HRTTs,  but  several  tasks  can  share  the  same  thread,  thus 
lowering  the  hardware  enforced  isolation.  Low  criticality  tasks 
can  then  be  shared  on  SRTTs  using  standard  scheduling  algo¬ 
rithms,  such  as  rate-monotonic  scheduling  and  EDF  [25].  In 


the  evaluation  section  (Section  IV-D),  we  apply  this  scheduling 
methodology  to  a  mixed-criticality  avionics  example. 

For  high  criticality  tasks  it  is  typically  required  that  the  | 
upper  bound  is  guaranteed  statically  at  compile  time.  Well  g- 
established  worst-case  execution  (WCET)  analysis  techniques  u 
can  be  applied  to  compute  safe  upper  bounds  on  tasks.  In  o 
particular,  HRTTs  possess  fine  grained  timing  predictability 
(at  the  instruction  level),  making  hardware  timing  analysis 
[26]  especially  simple.  Although  no  timing  analysis  tools 
currently  exist  for  the  RISC-V  ISA,  we  contend  that  standard 
WCET  computation  techniques  [17]  and  state-of-the-art  indus¬ 
trial  WCET  tools,  such  as  Abslnt3,  can  easily  be  adapted  to 
compute  WCET  estimates  for  HRTTs  for  FlexPRET.  Timing 
analysis  for  SRTT  is,  however,  inherently  harder,  because  of 
cycle  stealing.  Instead,  we  propose  to  use  measurement-based 
approaches  for  low  criticality  tasks.  Fortunately,  measurement 
of  time  is  particular  simple  in  the  proposed  architecture;  the 
ISA  timing  instructions  can  be  used  to  give  precise  measure¬ 
ments  with  minimal  overhead. 

IV.  Evaluation 

We  implemented  and  deployed  FlexPRET  as  a  soft-core  on 
an  FPGA  to  demonstrate  feasibility  and  provide  quantitative  re¬ 
source  costs  of  the  proposed  architectural  techniques.  We  also 
simulated  two  mixed-criticality  task  sets  running  on  FlexPRET. 

The  first  is  simple  and  shows  how  FlexPRET  provides  isolation 
to  HRTTs  and  efficient  utilization  with  SRTTs,  even  when  an 
error  occurs  on  an  SRTT.  The  second  is  more  complex  (21 
tasks  on  8  threads)  and  requires  single  threads  to  use  software- 
based  scheduling  to  execute  multiple  tasks.  We  explain  the 
methodology  used  and  discuss  the  implications. 

A.  Implementation 

We  implemented  FlexPRET  in  Chisel  [27],  a  hardware 
construction  language  that  generates  both  Verilog  code  and  a 
cycle-accurate  C++-based  simulator.  Chisel  allows  us  to  easily 
parameterize  the  code  to  produce  various  configurations  for 
both  FlexPRET  and  two  baseline  processors  for  comparison. 

We  implement  the  two  baseline  processors  instead  of  compar¬ 
ing  to  existing  processors  to  remove  the  influence  of  differing 
ISAs  and  optimization  techniques.  The  intent  is  to  show  and 
discuss  the  incremental  costs  of  flexible  thread  scheduling  and 
timing  instructions,  as  these  techniques  are  not  restricted  in  use 
to  a  5-stage  RISC-V  processor.  The  FlexPRET  processor  may 
be  referred  to  as  FlexPRET-4T ,  where  the  number  represents 
the  physical  number  of  threads  available  to  be  used. 

The  first  baseline  processor  will  be  referred  to  as  Base- 
1T  and  is  a  single-threaded  5-stage  RISC-V  processor  with 
scratchpad  memories,  predict  not-taken  branching,  and  for¬ 
warding,  stalling,  or  flushing  to  resolve  data  and  control 
hazards.  This  represents  a  simple  approach  to  achieving 
fine-grained  predictability,  and  would  require  software-based 
scheduled  to  execute  mixed-criticality  workloads.  This  pro¬ 
cessor  functions  identically  to  FlexPRET  when  FlexPRET’s 
scheduler  executes  the  same  thread  every  cycle. 

The  second  baseline  processor  will  be  referred  to  as  Base- 
41- RR  and  is  a  fine-grained  multithreaded  5-stage  RISC-V  pro¬ 
cessor  with  scratchpad  memories,  much  like  PTARM  [15]  and 
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XMOS  [16].  It  interleaves  a  fixed  four  threads  in  the  pipeline, 
and  consequently  does  not  require  any  forwarding,  stalling, 
or  flushing.  However,  all  four  threads  must  be  active  to  fully 
utilize  the  processor.  This  processor  functions  identically  to 
FlexPRET  when  FlexPRET’s  scheduler  executes  four  threads 
as  HRTTs  in  a  round-robin  pattern. 

In  all  processors,  a  subset  of  the  RISC-V  ISA  is  imple¬ 
mented,  currently  excluding  floating  point  arithmetic  instruc¬ 
tions,  atomic  memory  operations,  integer  division,  and  support 
for  packed  instructions.  These  instructions  are  not  needed 
to  demonstrate  the  main  ideas.  To  verify  correct  implemen¬ 
tation  of  both  FlexPRET  and  the  two  baseline  processors, 
an  assembly  test  suite  and  more  than  ten  C  benchmarks 
from  the  Malardalen  [28]  benchmark  suite  were  mn  for  all 
configurations  on  both  the  cycle-accurate  simulator  and  an 
FPGA  implementation  of  the  Verilog  code,  with  results  being 
examined  for  correctness. 

B.  FPGA  Resources 

Several  configurations  of  FlexPRET  and  the  baseline 
processors  were  deployed  on  an  FPGA  (Xilinx  Virtex-5 
XC5VLX110T)  to  evaluate  the  area  cost  associated  with 
different  features.  The  processors  were  all  clocked  at  80MHz 
with  a  16kB  I-SPM  and  a  16kB  D-SPM  in  block  RAM.  A 
block  RAM  has  a  fixed  size  that  happens  to  be  large  enough 
for  all  the  register  files  for  eight  threads,  so  a  single  dedicated 
block  RAM  is  used  for  all  register  files  in  each  configuration. 

The  flip-flop  (FF)  and  lookup-table  (LUT)  usages  are 
shown  in  Figure  2,  where  TI  implies  the  processor  implements 
timing  instructions  as  described  in  Section  III-D.  Although  the 
percentage  increase  may  appear  large  in  some  cases,  these 
numbers  are  from  bare-minimum  implementations  and  the 
absolute  cost  is  relatively  low.  As  a  processor’s  area  increases 
with  more  complex  functionality,  such  as  supporting  integer 
division,  a  floating-point  unit,  different  memory  hierarchy,  or 
more  peripherals,  the  relative  percentage  cost  drops. 

The  resource  difference  between  Base- IT  and  Base-4T- 
RR  shows  the  cost  of  fine-grained  multithreading,  with  a  5% 
increase  in  LUTs  and  a  42%  increase  in  FFs.  Although  Base- 
4T-RR  removes  forwarding  paths  and  control  logic  for  stalling 
and  flushing,  it  requires  more  multiplexing  based  on  thread 
IDs.  There  is  also  state  that  must  be  stored  for  each  thread, 
such  as  program  counter  and  some  control  registers. 


The  resource  difference  between  Base-4T-RR  and 
FlexPRET-4T  shows  the  cost  of  adding  flexible  thread 
scheduling.  The  35%  increase  in  LUTs  and  10%  increase  in 
FFs  is  caused  by  the  thread  scheduler,  forwarding  paths,  and 
additional  control  logic  for  stalling  and  flushing.  The  eight 
thread  versions,  Base-8T-RR  and  FlexPRET-8T,  are  useful  for 
applications  where  more  isolated  threads  are  desired,  but  do 
so  at  the  cost  of  additional  area. 

It  is  important  to  notice  that  these  modifications  do  not 
add  computational  power,  just  allow  it  to  be  more  efficiently 
utilized  for  mixed-criticality  task  sets  where  guarantees  must 
be  made.  As  a  consequence,  while  a  multicore  system  of 
Base- IT  cores  would  provide  more  computational  throughput 
per  area  for  many  soft-real  time  task  sets,  the  same  may  not 
hold  for  mixed-criticality  task  sets.  For  a  mixed-criticality  task 
set  where  hardware-based  isolation  guarantees  are  the  con¬ 
trolling  constraint,  fewer  FlexPRET  processors  could  provide 
the  same  functionality  of  many  more  traditional  processors — a 
substantial  area  and  power  savings.  For  example,  if  four  tasks 
that  require  hardware-based  isolation  can  execute  on  a  single 
FlexPRET-4T,  then  four  Base- IT  processors  are  not  needed. 

At  first  glance,  adding  the  timing  instructions  described 
in  Section  III-D  to  FlexPRET-4T  looks  costly,  with  a  79% 
increase  of  FFs  and  a  10%  increase  of  LUTs.  The  main  source 
is  supporting  delay  until  and  exception  on  expire  instructions, 
where  two  64-bit  expiration  times  need  to  be  stored  for  each 
thread  and  compared  to  the  current  time  every  cycle.  This 
additional  cost  could  be  reduced  by  roughly  half  if  absolute 
time  is  reduced  from  64-bits  to  32-bits,  but  rollover  becomes 
an  issue  if  any  time  interval  is  on  the  order  of  seconds. 
In  practice,  FlexPRET  could  use  less  bits  for  time  and  use 
software  to  handle  larger  time  intervals  to  reduce  area,  but  we 
showed  the  64-bit  version  because  it  is  worst-case  for  area. 
On  most  microcontrollers,  timers  are  implemented  outside  the 
processor  and  similar  area  would  be  required  to  achieve  the 
same  precision  and  flexibility. 


C.  Demonstration  of  Flardware-Based  Isolation  and  Efficient 
Resource  Utilization 

A  simple  mixed-criticality  example  with  four  periodic 
tasks,  ta,tb,tc  and  tjj,  was  simulated  on  FlexPRET 
as  a  more  concrete  demonstration  of  FlexPRET  providing 
hardware-based  isolation  and  efficient  resource  utilization. 
Although  the  FPGA  implementation  is  useful  for  evaluating 
correctness,  feasibility,  and  area  costs,  our  cycle-accurate  sim¬ 
ulator  is  useful  for  varying  configurations  and  monitoring  the 
timing  behavior  of  each  task.  This  example  simulates  tasks 
running  on  a  100MHz  FlexPRET-4T  with  a  32kB  I-SPM  and 
32kB  D-SPM,  a  configuration  that  would  not  be  unreasonable 
for  a  soft-core  or  ASIC  implementation. 

The  task  identifiers  A,  B ,  C,  D  also  correspond  to  critical¬ 
ity  levels  from  highest  to  lowest,  with  A  and  B  considered 
hard  real-time  tasks  and  C  and  D  soft  real-time  tasks.  Each 
task  executes  on  its  own  thread,  so  each  thread  may  be  referred 
to  by  the  task  it  executes.  Each  task’s  thread  ID,  initial  thread 
mode,  period  (Tf),  deadline  (77,) ,  and  worst-case  execution 
cycles  for  each  scheduling  frequency  (E^i, Eiti Eiti/3+) 
are  shown  in  Table  Ha. 


Task 

Thread 

ID 

Thread 

Mode 

TitDi 

(ms) 

Ei,  1 
(*105) 

Ei,  1/2 
(*105) 

Ei,  1/3+ 
(*105) 

TA 

0 

HA 

12 

4.00 

3.65 

3.45 

TB 

1 

HA 

6 

0.50 

0.45 

0.43 

TC 

2 

SA 

12 

4.80 

4.69 

4.59 

td 

3 

SA 

6 

1.00 

0.93 

0.86 

(a)  The  task  set 


D  \  D  \  S  \  S  \  0{rA)  \  2 (tc)  |  1(tb)  |  0(rA) 
(b)  The  slots  control  register 

TABLE  II:  A  simple  mixed-criticality  example 


The  WCET  C/  depends  on  the  worst-case  execution  cycles 
Ei}X  and  the  frequency  of  the  thread  f  *  x  (f  =  100 MHz 
is  the  frequency  of  the  processor),  where  C  =  Ex/(f  *  x). 
For  example,  if  ta  executes  every  processor  cycle  (x  =  1)  it 
would  take  4  *  105/(100  *  10 6 Hz)  =  4ms  to  complete,  and  if 
it  executed  ta  every  other  processor  cycle  (x  =  ^),  it  would 
take  3.65  *  105/ (100  *  10 6 Hz  *  |)  =  7.3 ms  to  complete. 

One  of  the  properties  of  FlexPRET  is  the  number  of  cycles 
required  to  complete  a  task  can  depend  on  the  scheduling 
frequency,  as  more  cycles  can  be  wasted  at  higher  scheduling 
frequencies  to  prevent  hazards.  To  account  for  this  varia¬ 
tion  in  execution  cycles,  each  task  iterates  a  program  from 
the  Malardalen  benchmark  suite  a  fixed  number  of  times 
to  reach  the  value  of  E,;  i  specified  in  Table  Ha,  and  the 
values  of  £)  1/2  and  Ei,i/3+  are  then  measured.  The  values 
of  Ei:  -[  are  contrived  to  highlight  properties.  Tasks  ta  and 
tb  use  statemate  (generated  automotive  controller),  tc  uses 
jfdctint  (discrete-cosine  transformation),  and  Tp  uses  insertsort 
(sorting  algorithm).  Inputs  are  always  the  same,  so  each  job 
(iteration  of  the  task)  takes  the  same  number  of  cycles.  Periodic 
release  of  each  task  is  simple:  after  the  task  completes  a  delay 
until  instruction  prevents  the  thread  from  being  scheduled  until 
the  next  period  starts. 

The  schedule  set  by  the  slots  control  register  is  shown  in 
Table  lib.  Task  t,\  executes  every  3rd  processor  cycle,  tb  every 
6th  processor  cycle,  tc  every  6th  processor  cycle.  Recall  that 
if  a  task  does  not  need  to  use  the  cycle  or  the  slot  is  marked  as 
S,  that  cycle  is  used  by  active  SRTTs  in  a  round-robin  fashion, 
potentially  tc  and  td  in  this  case.  Even  though  tc  is  a  soft- 
real  time  task,  it  is  given  a  scheduling  slot  in  order  to  execute 
more  frequently  than  td  because  it  requires  higher  processor 
utilization. 

Figure  3  shows  execution  traces  for  a  single  hyperperiod 
(t  =  12)  for  different  situations  on  FlexPRET-4T.  The  horizon¬ 
tal  axis  is  time  and  the  vertical  axis  is  one  iteration  through  the 
slots  control  register,  with  the  value  of  each  slot  in  parentheses. 
Figure  3  a  shows  normal  operation,  where  all  tasks  meet  their 
deadlines.  The  first  job  of  both  ta  and  tb  finishes  before 
the  deadline  and  the  remaining  allocated  cycles  are  shared  by 
active  SRTTs,  which  means  tc  and  775  until  td  completes. 
Note  that  this  task  set  would  not  even  be  schedulable  on  Base- 
47- RR  because  ta  and  tc  each  require  more  than  1/4  of 
the  processor  cycles  to  meet  their  deadlines,  which  cannot  be 
provided  in  four  thread  round-robin. 

Figure  3b  shows  an  error  case  where  td  completes  immedi- 
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(a)  FlexPRET-4T  in  normal  operation. 
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(b)  FlexPRET-4T  with  error:  td  completes  immediately. 
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(c)  FIexPRET-4T  with  error:  td  executes  infinitely. 
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Fig.  3:  FlexPRET-4T  demonstrating  hardware-based  isolation 
and  resource  utilization  efficiency. 


ately,  and  Figure  3c  and  shows  an  error  case  where  to  executes 
infinitely.  Their  cause  is  not  important,  these  two  extremes 
are  just  to  demonstrate  isolation  of  HRTTs.  Regardless  of  the 
operation  of  to,  the  timing  behavior  of  ta  and  tb,  the  most 
critical  tasks,  is  identical.  When  To  finishes  immediately,  tq 
executes  more  frequently  and  finishes  sooner  than  in  normal 
operation.  When  to  is  in  an  infinite  loop,  tc  takes  slightly 
longer  but  still  meets  its  deadline. 


D.  Case  Study:  Avionics  Mixed-Criticality  System 

To  demonstrate  a  possible  methodology  for  mapping  and 
scheduling  a  mixed  criticality  task  set  on  FlexPRET,  we 
simulated  a  task  set  from  Vestal’s  influential  paper  on  mixed- 
criticality  scheduling  [1],  The  abstract  workload  contains  21 
tasks  and  is  derived  from  a  time-partitioned  avionics  systems 
at  Honeywell.  The  allocated  execution  time  numbers  are  used 
to  have  a  combined  single-threaded  processor  utilization  of 
93%.  We  simulate  execution  time  using  the  same  method  as 
in  the  previous  section;  A  and  B  criticality  level  tasks  iterate 
the  stalemate  program,  and  C  and  D  use  either  jfdctint  or 
insertsort.  This  example  simulates  tasks  running  on  a  100MHz 
FlexPRET-8T  with  a  128kB  I-SPM  and  128kB  D-SPM. 

Our  approach  is  to  isolate  and  over-allocate  resources  to 
the  hard-real  time  tasks  (criticality  levels  A  and  B)  and  use 
slack  stealing  to  efficiently  utilize  the  processor  for  soft-real 
time  tasks  (criticality  levels  C  and  D),  similar  to  reservation- 
based  scheduling  used  in  commercial  RTOSs  for  mixed- 
criticality  systems  [29],  but  with  hardware-based  isolation 
guarantees.  Each  task’s  thread  ID,  period  (7’,),  deadline  ( D,), 
and  worst-case  execution  cycles  for  each  scheduling  frequency 
Ei  1/2,  Eiti/3+)  are  shown  in  Table  Ilia,  and  the  slots 
control  register  is  shown  in  Table  Illb. 


Tasks  tai~ta3  each  execute  on  their  own  HRTT,  isolating 
each  task  from  all  other  tasks.  Once  a  task  completes,  the  delay 
until  instruction  is  used  to  wait  until  the  next  period  release. 
Each  are  given  a  scheduling  frequency  of  1/8  because  this 
is  sufficient  for  meeting  respective  deadlines;  over-allocation 
is  acceptable  because  delay  until  will  donate  cycles  to  other 
threads.  Complete  hardware-based  isolation  and  lack  of  pre¬ 
emption  simplifies  WCET  analysis  and  provides  the  highest 
level  of  confidence. 

Due  to  resource  constraints,  tasks  tbi  —  tb7  cannot  each 
execute  on  their  own  HRTT,  but  they  can  still  be  isolated  from 
the  A,  C,  and  D  criticality  level  tasks.  Tasks  tbi  —  t"B3  are 
mapped  to  one  HRTT  and  are  able  to  use  use  a  non-preemptive 
static  schedule  to  meet  their  respective  deadlines  at  scheduling 
frequency  1  / 4;  tasks  tba  —  Tin  are  mapped  to  another  HRTT 
and  use  preemptive  rate-monotonic  software  scheduling  to 
meet  their  respective  deadlines  at  schedule  frequency  1/4. 
Different  scheduling  algorithms  could  be  used,  but  we  selected 
the  simplest  ones  that  provided  schedulability  for  these  task 
sets  to  provide  the  highest  level  of  confidence.  Both  HRTT 
will  donate  cycles  when  not  needed:  the  static  schedule  uses 
delay  until  until  next  periodic  release,  and  the  rate-monotonic 
scheduler  uses  thread  sleep  until  an  exception  on  expire  occurs 
to  release  tasks. 

Even  though  tci,toi  —  fog  could  be  mapped  to  one 
SRTT,  we  use  remaining  thread  resources  and  split  between 
two  SRTTs  to  improve  thread  throughput.  Each  SRTT  uses 
a  preemptive  EDF  scheduler  for  simplicity,  although  other 
scheduling  algorithms  could  also  be  used.  Exception  on  expire 
is  used  to  add  new  jobs  to  a  priority  queue  sorted  by  deadline, 
and  the  scheduler  is  run  whenever  a  job  finishes  or  jobs  are 
added,  possibly  preempting  an  executing  job.  Despite  not  being 
allocated  a  slot  in  the  schedule,  slack  stealing  from  the  over¬ 
allocation  of  cycles  to  the  HRTTs  of  the  A  and  B  criticality 
level  tasks  is  enough  to  meet  all  deadlines  in  this  example. 

Figure  4  shows  execution  traces  for  a  single  hyperperiod 
( t  =  200).  Each  subplot  is  for  a  different  thread  and  the 
rectangles  show  when  the  jobs  of  each  task  are  executing. 
Whenever  the  job  is  preempted,  the  rectangle  is  a  lighter  shade 
with  a  dotted  line.  Up  arrows  are  release  times  and  down 
arrows  are  deadlines  for  each  task.  Each  job  of  a  task  always 
takes  the  same  number  of  cycles.  For  tasks  tai  —  taa  and 
tbi  ~  t~B7,  the  threads  are  isolated  so  each  job  takes  the  same 
amount  of  time.  Notice  that  the  execution  times  of  tasks  tc i 
and  Tm  —  Tog  vary,  this  is  because  the  number  of  cycles 
donated  by  tai  —  TA4  and  tbi  —  tb7  varies  as  well. 

V.  Related  Work 

This  paper  is  most  closely  related  to  two  areas  of  research: 
timing  predictable  processors  and  software-based  scheduling 
for  mixed-criticality  systems. 

A.  Timing  Predictable  Processors 

Berg  et  al.  [30]  and  Heckmann  et  al.  [31]  identified  the  ar¬ 
chitectural  properties  that  complicate  WCET  analysis  and  pro¬ 
posed  design  principles  that  facilitate  it.  Edwards  and  Lee  [32] 
went  further  to  argue  for  precision  time  (PRET)  machines  that 
incorporate  time  into  the  abstraction  levels,  making  temporal 
behavior  as  important  as  logical  functionality.  Researchers 


Task 

Thread 

ID 

Thread 

Mode 

Ti,Di 

(ms) 

Ei,  1 
(*105) 

Ej  ,1/2 
(*105) 

Ei,  1/3+ 
(*105) 

TA1 

0 

HA 

25 

1.10 

1.00 

0.95 

T A2 

1 

HA 

50 

1.S0 

1.64 

1.55 

TA3 

2 

HA 

100 

2.00 

1.82 

1.72 

TA4 

3 

HA 

200 

5.30 

4.83 

4.56 

tbi 

4 

HA 

25 

1.40 

1.27 

1.20 

tB2 

4 

HA 

50 

3.90 

3.54 

3.34 

ZB3 

4 

HA 

50 

2.80 

2.54 

2.40 

TB4 

5 

HA 

50 

1.40 

1.28 

1.21 

TB5 

5 

HA 

50 

3.70 

3.37 

3.19 

tB6 

5 

HA 

100 

1.80 

1.64 

1.55 

TB7 

5 

HA 

200 

8.50 

7.75 

7.32 

TCI 

6 

SA 

50 

1.90 

1.77 

1.63 

T~D1 

6 

SA 

50 

5.40 

5.03 

4.65 

UD2 

6 

SA 

200 

2.40 

2.33 

2.28 

TD3 

6 

SA 

50 

1.30 

1.26 

1.23 

tD4 

6 

SA 

200 

1.50 

1.45 

1.42 

TD5 

7 

SA 

25 

2.30 

2.14 

1.98 

tD6 

7 

SA 

100 

4.80 

4.65 

4.30 

TD7 

7 

SA 

200 

13.00 

12.70 

12.44 

tD8 

7 

SA 

100 

0.60 

0.57 

0.56 

tD9 

7 

SA 

50 

2.40 

2.33 

2.28 

(a)  The  task  set 
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(b)  The  slots  control  register 


TABLE  III:  A  case  study  on  an  avionics  mixed-criticality 
system 
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Fig.  4:  FlexPRET-8T  executing  a  mixed-criticality  avionics 
case  study. 


have  proposed  many  timing  predictable  processor  for  real-time 
systems;  each  processor  making  different  trade-offs  to  target  an 
application  space.  The  SPEAR  processor  by  Delvai  et  al.  [33] 
is  a  16-bit,  3-stage  processor  that  has  constant-time  instructions 
by  removing  caches  and  assuming  single-path  programming. 
Schoeberl’s  Java  Optimized  Processor  (JOP)  [34]  predictably 
executes  Java  bytecode  by  translating  it  to  microcode  for  a  sim¬ 
ple  3-stage  pipeline.  To  execute  a  program  with  synchronous 
semantics,  Andalam’s  ARPRET  [35]  achieves  predictability  by 
customizing  an  existing  soft-core  processor. 

PTARM  [15]  by  Liu  et  al.  and  XMOS  XI  [16]  are  the 
most  closely  related  processors  to  FlexPRET.  Both  are  fine¬ 
grained  multithreaded  5-stage  RISC  processors  that  require  at 
least  four  threads  (exactly  four  threads  for  PTARM)  to  be 
round-robin  interleaved  in  the  pipeline;  cycles  are  wasted  if 
there  are  fewer  than  four  active  threads,  and  a  single  thread 
can  only  be  executed  at  most  once  every  four  cycles.  PTARM 
is  better  suited  for  hard-real  time  tasks  because  all  threads  have 
a  constant  scheduling  frequency.  Conversely,  XMOS  is  better 
suited  for  soft-real  time  tasks  because  inactive  tasks  can  be 
left  out  of  round-robin  scheduling,  but  scheduling  frequency 
depends  on  the  maximum  number  of  simultaneously  active 
threads. 


B.  Software-based  Scheduling 

Software  based  scheduling  for  mixed-criticality  software  is 
typically  either  reservation-based  or  priority-based  [2],  Rese¬ 
rvation-based  is  best  demonstrated  by  the  ARINC  653  standard 


used  in  integrated  modular  avionic  (IMA)  systems  [36].  Crit¬ 
ical  tasks  are  guaranteed  segments  of  time,  and  most  RTOSs 
will  steal  cycles  for  other  tasks  if  a  task  finishes  early,  as  done 
by  Wind  River’s  VxWorks  653  RTOS  [29]. 

Using  priority-based  preemptive  scheduling  for  mixed- 
criticality  systems  was  first  proposed  by  Vestal  [1],  Since 
then,  there  has  been  much  work  addressing  scheduling  the¬ 
ory  of  mixed-criticality  systems,  as  recently  summarized  by 
Burns  [11],  Scheduling  sporadic  tasks  was  first  addressed 
by  Baruah  and  Vestal  [8],  and  Niz  et  al.  [9]  presented  a 
scheduling  algorithm  that  protects  high-criticality  tasks  from 
low-criticality  tasks,  even  if  a  nomimal  WCET  is  overrun. 
More  recently,  Mollison  et  al.  [10]  proposed  an  approach  for 
multicore  platforms.  Although  FlexPRET  does  not  implement 
priority-based  scheduling  in  hardware,  it  can  still  be  used  as  a 
platform  for  these  algorithms:  either  scheduling  tasks  within  a 
single  thread  or  changing  the  thread  scheduling. 

VI.  Conclusions  and  Future  Work 

Hardware-based  isolation  requires  executing  each  task  on 
a  separate  computational  component,  which  could  be  a  pro¬ 
cessor,  core,  or  hardware  thread,  and  typically  results  in 
underutilization  of  hardware  resources.  FlexPRET  uses  fine¬ 
grained  multithreading  and  flexible  thread  scheduling  to  pro¬ 
vide  hardware-based  isolation  and  predictability  to  HRTTs,  but 
also  allows  SRTTs  to  use  any  cycle  not  needed  by  an  HRTT. 

We  consider  FlexPRET  a  key  contribution  to  a  precision 
timed  infrastructure  [24],  where  languages,  compilers,  and 
architectures  allow  the  specification  and  preservation  of  timing 


semantics.  The  next  steps  in  that  direction  are  tool  support  for 
formal  verification  of  hard  real-time  tasks  and  investigating 
how  languages  can  leverage  FlexPRET’s  properties.  From  the 
hardware  perspective,  the  presented  architectural  techniques 
could  be  applied  to  other  processors,  or  FlexPRET  could  be 
a  core  in  a  multicore  system  that  uses  a  predictable  network- 
on-a-chip  for  communication. 

This  type  of  hardware  platform  presents  an  interesting  new 
scheduling  problem.  Not  only  do  tasks  need  to  be  mapped 
and  scheduled  on  threads,  but  the  logical  execution  frequency 
of  each  thread  can  be  controlled  by  modifying  the  slots 
control  register  during  run-time.  How  to  optimally  exploit  this 
flexiblity  to  control  resource  distribution  amongst  SRTTs  while 
still  maintaining  hardware-based  isolation  for  HRTTs  is  an 
open  scheduling  problem. 
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